1 Iris dataset presentation

The Iris dataset, also known as Fisher’s Iris or Anderson’s Iris, is a multivariate dataset introduced in 1936 by Ronald Fisher in his paper The use of multiple measurements in taxonomic problems as an example of the application of linear discriminant analysis1 . Data were collected by Edgar Anderson to quantify variations in the morphology of iris flowers of three species1. Two of the three species were collected in the Gaspé Peninsula. “All are from the same field, picked on the same day and measured on the same day by the same person with the same measuring tools1”.

The data set includes 50 samples of each of the three iris species (Iris setosa, Iris virginica and Iris versicolor). Four characteristics were measured from each sample: length and width of sepals and petals, in centimetres. Based on the combination of these four variables, Fisher developed a linear discriminant analysis model to distinguish between the species.

1Iris de Fisher - Wikipédia

In this project, I found :

  1. found the best algorithm to predict species

Skills developed: Visualization, Supervised machine learning, Unsupervised machine learning

1.1 Load libraries

1.2 Load datasets

1.3 Univariate Description

1.3.1 Graphic description

iris %>%
  select_if(is.numeric) %>%
  gather() %>%
  ggplot(aes(x=key, y=value, fill=key)) + 
    geom_boxplot() +
  labs(title = "Boxplot of each numeric variables",
       x = "variables")

1.3.2 Numeric description

skim(iris)
Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.80 6.4 7.9 ▆▇▇▅▂
Sepal.Width 0 1 3.06 0.44 2.0 2.8 3.00 3.3 4.4 ▁▆▇▂▁
Petal.Length 0 1 3.76 1.77 1.0 1.6 4.35 5.1 6.9 ▇▁▆▇▂
Petal.Width 0 1 1.20 0.76 0.1 0.3 1.30 1.8 2.5 ▇▁▇▅▃
iris %>%
  group_by(Species) %>%
  skim()
Data summary
Name Piped data
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
numeric 4
________________________
Group variables Species

Variable type: numeric

skim_variable Species n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length setosa 0 1 5.01 0.35 4.3 4.80 5.00 5.20 5.8 ▃▃▇▅▁
Sepal.Length versicolor 0 1 5.94 0.52 4.9 5.60 5.90 6.30 7.0 ▂▇▆▃▃
Sepal.Length virginica 0 1 6.59 0.64 4.9 6.23 6.50 6.90 7.9 ▁▃▇▃▂
Sepal.Width setosa 0 1 3.43 0.38 2.3 3.20 3.40 3.68 4.4 ▁▃▇▅▂
Sepal.Width versicolor 0 1 2.77 0.31 2.0 2.52 2.80 3.00 3.4 ▁▅▆▇▂
Sepal.Width virginica 0 1 2.97 0.32 2.2 2.80 3.00 3.18 3.8 ▂▆▇▅▁
Petal.Length setosa 0 1 1.46 0.17 1.0 1.40 1.50 1.58 1.9 ▁▃▇▃▁
Petal.Length versicolor 0 1 4.26 0.47 3.0 4.00 4.35 4.60 5.1 ▂▂▇▇▆
Petal.Length virginica 0 1 5.55 0.55 4.5 5.10 5.55 5.88 6.9 ▃▇▇▃▂
Petal.Width setosa 0 1 0.25 0.11 0.1 0.20 0.20 0.30 0.6 ▇▂▂▁▁
Petal.Width versicolor 0 1 1.33 0.20 1.0 1.20 1.30 1.50 1.8 ▅▇▃▆▁
Petal.Width virginica 0 1 2.03 0.27 1.4 1.80 2.00 2.30 2.5 ▂▇▆▅▇

1.3.3 Table one

iris %>%
  tbl_summary(by = Species,
              type = all_continuous() ~ "continuous2",
              statistic = all_continuous() ~ c("{median} ({p25}-{p75})", "{mean} ({sd})")) %>%
  add_overall(last = TRUE) %>%
  add_stat_label()
Characteristic setosa, N = 50 versicolor, N = 50 virginica, N = 50 Overall, N = 150
Sepal.Length
    Median (25%-75%) 5.00 (4.80-5.20) 5.90 (5.60-6.30) 6.50 (6.23-6.90) 5.80 (5.10-6.40)
    Mean (SD) 5.01 (0.35) 5.94 (0.52) 6.59 (0.64) 5.84 (0.83)
Sepal.Width
    Median (25%-75%) 3.40 (3.20-3.68) 2.80 (2.53-3.00) 3.00 (2.80-3.18) 3.00 (2.80-3.30)
    Mean (SD) 3.43 (0.38) 2.77 (0.31) 2.97 (0.32) 3.06 (0.44)
Petal.Length
    Median (25%-75%) 1.50 (1.40-1.58) 4.35 (4.00-4.60) 5.55 (5.10-5.88) 4.35 (1.60-5.10)
    Mean (SD) 1.46 (0.17) 4.26 (0.47) 5.55 (0.55) 3.76 (1.77)
Petal.Width
    Median (25%-75%) 0.20 (0.20-0.30) 1.30 (1.20-1.50) 2.00 (1.80-2.30) 1.30 (0.30-1.80)
    Mean (SD) 0.25 (0.11) 1.33 (0.20) 2.03 (0.27) 1.20 (0.76)

1.4 Multivariate description

1.4.1 Correlation matrice

Important : Only numeric variables can be used in PCA.

corrplot(round(cor(select_if(iris, is.numeric)),2), 
         type="upper", 
         order="hclust", 
         tl.col="black", 
         tl.srt=45)

1.4.2 Principal component analysis

res.pca <- PCA(select_if(iris, is.numeric), graph = FALSE)
plot(res.pca, choix = "var")

jpeg("images/pca_plot.jpg")
plot(res.pca, choix = "var")
dev.off()
## png 
##   2
barplot(res.pca$eig[, 2], names.arg=1:nrow(res.pca$eig), 
       main = "Variances",
       xlab = "Principal Components",
       ylab = "Percentage of variances",
       col ="steelblue")
# Add connected line segments to the plot
lines(x = 1:nrow(res.pca$eig), res.pca$eig[, 2], 
      type="b", pch=19, col = "red")

Interpretation : Sepal.Length, Pental.Width and Petal.Lenght are highly correlated variables: knowing one of the three variables gives a fairly good idea of the values of the others.

1.5 Machine learning classification

1.5.1 Hierarchical Clustering

res.h <- hclust(dist(iris), method = "complete")
plot(res.h, hang = -1, cex = 0.6)

jpeg("images/results_hclust.jpg")
plot(res.h, hang = -1, cex = 0.6)
dev.off()
## png 
##   2
# table(iris$Species, 
#       cutree(res.h, k=3))

res.iris <- tibble(species = iris$Species, 
                   hclust = cutree(res.h, k=3) ) %>%
  mutate(hclust = case_when(hclust == 1 ~ "setosa",
                             hclust == 2 ~ "virginica",
                             hclust == 3 ~ "versicolor") )

Interpretation : With hclust method, Setosa and virginica are recognised almost all the time. The species versicolor is confused half the time with the species virginica.

1.5.2 K-Nearest neighboors

iris_split <- initial_split(iris, prop = 0.7)
iris_train <- iris_split %>% training()
iris_test <- iris_split %>% testing()

nearest_neighbor_kknn_spec <-
  nearest_neighbor() %>%
  set_engine('kknn') %>%
  set_mode('classification')

knn_mod <- nearest_neighbor_kknn_spec %>%
  fit(Species ~ ., iris_train) 
knn_last_fit <- last_fit(nearest_neighbor_kknn_spec, recipe(Species ~ ., data = iris), iris_split)

knn.a <- accuracy(cbind(iris, predict(knn_mod, iris)), Species, .pred_class)$.estimate

res.iris$knn <- predict(knn_mod, iris)$.pred_class

knn.plot <- ggplot(res.iris, aes(x=knn, fill=species)) + 
  geom_bar() + 
  labs(title = "Knn model",
       x="Species from KNN model")
ggplotly(knn.plot)

1.5.3 Kmeans

fviz_nbclust(select_if(iris, is.numeric), kmeans, method = "wss")

res.km <- kmeans(select_if(iris, is.numeric), centers = 3, nstart = 25)$cluster
res.km <- as.factor( ifelse(res.km == 1, "virginica", ifelse(res.km == 2, "versicolor", "setosa") ) )

# table(iris$Species, res.km)
km.a <- accuracy(cbind(iris, res.km), Species, res.km)$.estimate

res.iris$km <- res.km

km.plot <- ggplot(res.iris, aes(x=km, fill=species)) + 
  geom_bar() + 
  labs(title = "Kmeans model",
       x="Species from kmeans model")
ggplotly(km.plot)

1.5.4 Xgboost

xgboost_parnsnip <-
  boost_tree() %>%
  set_engine('xgboost') %>%
  set_mode('classification')

res.xgboost <- xgboost_parnsnip %>%
  fit(Species ~ ., data = iris) %>%
  predict(iris) %>% 
  pull(.pred_class)

xgboost.a <- accuracy(cbind(iris, res.xgboost), Species, res.xgboost)$.estimate
res.iris$xgboost <- res.xgboost

xgboost.plot <- ggplot(res.iris, aes(x=xgboost, fill=species)) + 
  geom_bar() + 
  labs(title = "Xgboost model",
       x="Species from Xgboost model")
ggplotly(xgboost.plot)

1.5.5 Ranger

ranger_parnsnip <-
  rand_forest() %>%
  set_engine('ranger') %>%
  set_mode('classification')

res.ranger <- ranger_parnsnip %>%
  fit(Species ~ ., data = iris) %>%
  predict(iris) %>% 
  pull(.pred_class)

ranger.a <- accuracy(cbind(iris, res.ranger), Species, res.ranger)$.estimate
res.iris$ranger <- res.ranger

ranger.plot <- ggplot(res.iris, aes(x=ranger, fill=species)) + 
  geom_bar() + 
  labs(title = "Ranger model",
       x="Species from Ranger model")
ggplotly(ranger.plot)

1.5.6 Comparaison models

The objective right now is to compare the 4 models. 2 methods : graphically, and with accuracy of each model.

(knn.plot + km.plot)/ (xgboost.plot + ranger.plot)

The accuracy table :

data.frame(model = c("knn", "kmeans", "xgboost", "ranger"),
           accuracy = c(knn.a, km.a, xgboost.a, ranger.a))
##     model  accuracy
## 1     knn 0.9733333
## 2  kmeans 0.3200000
## 3 xgboost 1.0000000
## 4  ranger 0.9800000

Analyse results with PCA :

res.pca.res <- PCA(res.iris %>%
  mutate(species = case_when(species == "setosa" ~ 1,
                             species == "virginica" ~ 2,
                             species == "versicola" ~ 3),
         hclust = case_when(hclust == "setosa" ~ 1,
                             hclust == "virginica" ~ 2,
                             hclust == "versicola" ~ 3),
         knn = case_when(knn == "setosa" ~ 1,
                             knn == "virginica" ~ 2,
                             knn == "versicola" ~ 3),
         km = case_when(km == "setosa" ~ 1,
                             km == "virginica" ~ 2,
                             km == "versicola" ~ 3),
         xgboost = case_when(xgboost == "setosa" ~ 1,
                             xgboost == "virginica" ~ 2,
                             xgboost == "versicola" ~ 3),
         ranger = case_when(ranger == "setosa" ~ 1,
                             ranger == "virginica" ~ 2,
                             ranger == "versicola" ~ 3)), graph = FALSE)

plot(res.pca.res, choix = "var")

1.6 With dimension reduction

iris_pca <- PCA(iris %>% select(-Species), 
                ncp = 3, 
                graph = FALSE)$ind$coord %>%
  as_tibble() %>%
  mutate(Species =  case_when(iris$Species == "setosa" ~ 1,
                              iris$Species == "virginica" ~ 2,
                              iris$Species == "versicolor" ~ 3),
         Species = as_factor(Species))

1.6.1 Hierarchical Clustering

res.h <- hclust(dist(iris_pca), method = "complete")
plot(res.h, hang = -1, cex = 0.6)

# table(iris$Species,
#       cutree(res.h, k=3))

res.iris <- tibble(species = iris$Species, 
                   hclust = cutree(res.h, k=3) ) %>%
  mutate(hclust = case_when(hclust == 1 ~ "setosa",
                             hclust == 2 ~ "virginica",
                             hclust == 3 ~ "versicolor") )

1.6.2 K-Nearest neighboors

iris_pca_split <- initial_split(iris_pca, prop = 0.7)
iris_pca_train <- iris_pca_split %>% training()
iris_pca_test <- iris_pca_split %>% testing()

nearest_neighbor_kknn_spec <-
  nearest_neighbor() %>%
  set_engine('kknn') %>%
  set_mode('classification')

knn_mod <- nearest_neighbor_kknn_spec %>%
  fit(Species ~ ., iris_pca_train) 
knn_last_fit <- last_fit(nearest_neighbor_kknn_spec, recipe(Species ~ ., data = iris_pca), iris_pca_split)

knn.a.pca <- accuracy(cbind(iris_pca, predict(knn_mod, iris_pca)), Species, .pred_class)$.estimate

res.iris$knn <- predict(knn_mod, iris_pca)$.pred_class

knn.plot.pca <- ggplot(res.iris, aes(x=knn, fill=species)) + 
  geom_bar() + 
  labs(title = "Knn model",
       x="Species from KNN model")
ggplotly(knn.plot.pca)

1.6.3 Kmeans

fviz_nbclust(select_if(iris_pca, is.numeric), kmeans, method = "wss")

res.km <- kmeans(select_if(iris_pca, is.numeric), centers = 3, nstart = 25)$cluster
res.km <- as.factor( ifelse(res.km == 1, "virginica", ifelse(res.km == 2, "versicolor", "setosa") ) )

# table(iris$Species, res.km)
km.a.pca <- accuracy(cbind(iris, res.km), Species, res.km)$.estimate

res.iris$km <- res.km

km.plot.pca <- ggplot(res.iris, aes(x=km, fill=species)) + 
  geom_bar() + 
  labs(title = "Kmeans model",
       x="Species from kmeans model")
ggplotly(km.plot.pca)

1.6.4 Xgboost

xgboost_parnsnip <-
  boost_tree() %>%
  set_engine('xgboost') %>%
  set_mode('classification')

res.xgboost <- xgboost_parnsnip %>%
  fit(Species ~ ., data = iris_pca) %>%
  predict(iris_pca) %>% 
  pull(.pred_class)

xgboost.a.pca <- accuracy(cbind(iris_pca, res.xgboost), Species, res.xgboost)$.estimate
res.iris$xgboost <- res.xgboost

xgboost.plot.pca <- ggplot(res.iris, aes(x=xgboost, fill=species)) + 
  geom_bar() + 
  labs(title = "Xgboost model",
       x="Species from Xgboost model")
ggplotly(xgboost.plot.pca)

1.6.5 Ranger

ranger_parnsnip <-
  rand_forest() %>%
  set_engine('ranger') %>%
  set_mode('classification')

res.ranger <- ranger_parnsnip %>%
  fit(Species ~ ., data = iris_pca) %>%
  predict(iris_pca) %>% 
  pull(.pred_class)

ranger.a.pca <- accuracy(cbind(iris_pca, res.ranger), Species, res.ranger)$.estimate
res.iris$ranger <- res.ranger

ranger.plot.pca <- ggplot(res.iris, aes(x=ranger, fill=species)) + 
  geom_bar() + 
  labs(title = "Ranger model",
       x="Species from Ranger model")
ggplotly(ranger.plot.pca)

1.6.6 Comparaison models

The objective right now is to compare the 4 models. 2 methods : graphically, and with accuracy of each model.

(knn.plot.pca + km.plot.pca)/ (xgboost.plot.pca + ranger.plot.pca)

The accuracy table :

data.frame(model = c("knn", "kmeans", "xgboost", "ranger"),
           accuracy = c(knn.a, km.a, xgboost.a, ranger.a),
           accuracy_PCA = c(knn.a.pca, km.a.pca, xgboost.a.pca, ranger.a.pca))
##     model  accuracy accuracy_PCA
## 1     knn 0.9733333    0.9466667
## 2  kmeans 0.3200000    0.2600000
## 3 xgboost 1.0000000    1.0000000
## 4  ranger 0.9800000    1.0000000

1.7 With scaling

iris_scaled <- scale(iris %>% select(-Species), center = FALSE, scale = TRUE) %>%
  as_tibble() %>%
  mutate(Species = iris$Species, 
         Species =  case_when(iris$Species == "setosa" ~ 1,
                              iris$Species == "virginica" ~ 2,
                              iris$Species == "versicolor" ~ 3),
         Species = as_factor(Species))

1.7.1 Hierarchical Clustering

res.h <- hclust(dist(iris_scaled), method = "complete")
plot(res.h, hang = -1, cex = 0.6)

# table(iris$Species,
#       cutree(res.h, k=3))

res.iris <- tibble(species = iris$Species, 
                   hclust = cutree(res.h, k=3) ) %>%
  mutate(hclust = case_when(hclust == 1 ~ "setosa",
                             hclust == 2 ~ "virginica",
                             hclust == 3 ~ "versicolor") )

1.7.2 K-Nearest neighboors

iris_scaled_split <- initial_split(iris_scaled, prop = 0.7)
iris_scaled_train <- iris_scaled_split %>% training()
iris_scaled_test <- iris_scaled_split %>% testing()

nearest_neighbor_kknn_spec <-
  nearest_neighbor() %>%
  set_engine('kknn') %>%
  set_mode('classification')

knn_mod <- nearest_neighbor_kknn_spec %>%
  fit(Species ~ ., iris_scaled_train) 
knn_last_fit <- last_fit(nearest_neighbor_kknn_spec, recipe(Species ~ ., data = iris_scaled), iris_scaled_split)

knn.a.scaled <- accuracy(cbind(iris_scaled, predict(knn_mod, iris_scaled)), Species, .pred_class)$.estimate

res.iris$knn <- predict(knn_mod, iris_scaled)$.pred_class

knn.plot.scaled <- ggplot(res.iris, aes(x=knn, fill=species)) + 
  geom_bar() + 
  labs(title = "Knn model",
       x="Species from KNN model")
ggplotly(knn.plot.scaled)

1.7.3 Kmeans

fviz_nbclust(select_if(iris_scaled, is.numeric), kmeans, method = "wss")

res.km <- kmeans(select_if(iris_scaled, is.numeric), centers = 3, nstart = 25)$cluster
res.km <- as.factor( ifelse(res.km == 1, "virginica", ifelse(res.km == 2, "versicolor", "setosa") ) )

# table(iris$Species, res.km)
km.a.scaled <- accuracy(cbind(iris, res.km), Species, res.km)$.estimate

res.iris$km <- res.km

km.plot.scaled <- ggplot(res.iris, aes(x=km, fill=species)) + 
  geom_bar() + 
  labs(title = "Kmeans model",
       x="Species from kmeans model")
ggplotly(km.plot.scaled)

1.7.4 Xgboost

xgboost_parnsnip <-
  boost_tree() %>%
  set_engine('xgboost') %>%
  set_mode('classification')

res.xgboost <- xgboost_parnsnip %>%
  fit(Species ~ ., data = iris_scaled) %>%
  predict(iris_scaled) %>% 
  pull(.pred_class)

xgboost.a.scaled <- accuracy(cbind(iris_scaled, res.xgboost), Species, res.xgboost)$.estimate
res.iris$xgboost <- res.xgboost

xgboost.plot.scaled <- ggplot(res.iris, aes(x=xgboost, fill=species)) + 
  geom_bar() + 
  labs(title = "Xgboost model",
       x="Species from Xgboost model")
ggplotly(xgboost.plot.scaled)

1.7.5 Ranger

ranger_parnsnip <-
  rand_forest() %>%
  set_engine('ranger') %>%
  set_mode('classification')

res.ranger <- ranger_parnsnip %>%
  fit(Species ~ ., data = iris_scaled) %>%
  predict(iris_scaled) %>% 
  pull(.pred_class)

ranger.a.scaled <- accuracy(cbind(iris_scaled, res.ranger), Species, res.ranger)$.estimate
res.iris$ranger <- res.ranger

ranger.plot.scaled <- ggplot(res.iris, aes(x=ranger, fill=species)) + 
  geom_bar() + 
  labs(title = "Ranger model",
       x="Species from Ranger model")
ggplotly(ranger.plot.scaled)

1.7.6 Comparaison models

The objective right now is to compare the 4 models. 2 methods : graphically, and with accuracy of each model.

Four graphs on the top : models without PCA, four models int he bottom : with PCA.

(knn.plot.scaled + km.plot.scaled)/ (xgboost.plot.scaled + ranger.plot.scaled)

The accuracy table :

data.frame(model = c("knn", "kmeans", "xgboost", "ranger"),
           accuracy = c(knn.a, km.a, xgboost.a, ranger.a),
           accuracy_PCA = c(knn.a.pca, km.a.pca, xgboost.a.pca, ranger.a.pca),
           accuracy_scaled = c(knn.a.scaled, km.a.scaled, xgboost.a.scaled, ranger.a.scaled))
##     model  accuracy accuracy_PCA accuracy_scaled
## 1     knn 0.9733333    0.9466667            0.96
## 2  kmeans 0.3200000    0.2600000            0.32
## 3 xgboost 1.0000000    1.0000000            1.00
## 4  ranger 0.9800000    1.0000000            0.98

1.8 With scaling and centering

iris_centered <- scale(iris %>% select(-Species), center = TRUE, scale = TRUE) %>%
  as_tibble() %>%
  mutate(Species = iris$Species, 
         Species =  case_when(iris$Species == "setosa" ~ 1,
                              iris$Species == "virginica" ~ 2,
                              iris$Species == "versicolor" ~ 3),
         Species = as_factor(Species))

1.8.1 Hierarchical Clustering

res.h <- hclust(dist(iris_centered), method = "complete")
plot(res.h, hang = -1, cex = 0.6)

# table(iris$Species,
#       cutree(res.h, k=3))

res.iris <- tibble(species = iris$Species, 
                   hclust = cutree(res.h, k=3) ) %>%
  mutate(hclust = case_when(hclust == 1 ~ "setosa",
                             hclust == 2 ~ "virginica",
                             hclust == 3 ~ "versicolor") )

1.8.2 K-Nearest neighboors

iris_centered_split <- initial_split(iris_centered, prop = 0.7)
iris_centered_train <- iris_centered_split %>% training()
iris_centered_test <- iris_centered_split %>% testing()

nearest_neighbor_kknn_spec <-
  nearest_neighbor() %>%
  set_engine('kknn') %>%
  set_mode('classification')

knn_mod <- nearest_neighbor_kknn_spec %>%
  fit(Species ~ ., iris_centered_train) 
knn_last_fit <- last_fit(nearest_neighbor_kknn_spec, recipe(Species ~ ., data = iris_centered), iris_scaled_split)

knn.a.centered <- accuracy(cbind(iris_centered, predict(knn_mod, iris_centered)), Species, .pred_class)$.estimate

res.iris$knn <- predict(knn_mod, iris_centered)$.pred_class

knn.plot.centered <- ggplot(res.iris, aes(x=knn, fill=species)) + 
  geom_bar() + 
  labs(title = "Knn model",
       x="Species from KNN model")
ggplotly(knn.plot.centered)

1.8.3 Kmeans

fviz_nbclust(select_if(iris_centered, is.numeric), kmeans, method = "wss")

res.km <- kmeans(select_if(iris_centered, is.numeric), centers = 3, nstart = 25)$cluster
res.km <- as.factor( ifelse(res.km == 1, "virginica", ifelse(res.km == 2, "versicolor", "setosa") ) )

# table(iris$Species, res.km)
km.a.centered <- accuracy(cbind(iris, res.km), Species, res.km)$.estimate

res.iris$km <- res.km

km.plot.centered <- ggplot(res.iris, aes(x=km, fill=species)) + 
  geom_bar() + 
  labs(title = "Kmeans model",
       x="Species from kmeans model")
ggplotly(km.plot.centered)

1.8.4 Xgboost

xgboost_parnsnip <-
  boost_tree() %>%
  set_engine('xgboost') %>%
  set_mode('classification')

res.xgboost <- xgboost_parnsnip %>%
  fit(Species ~ ., data = iris_centered) %>%
  predict(iris_centered) %>% 
  pull(.pred_class)

xgboost.a.centered <- accuracy(cbind(iris_centered, res.xgboost), Species, res.xgboost)$.estimate
res.iris$xgboost <- res.xgboost

xgboost.plot.centered <- ggplot(res.iris, aes(x=xgboost, fill=species)) + 
  geom_bar() + 
  labs(title = "Xgboost model",
       x="Species from Xgboost model")
ggplotly(xgboost.plot.centered)

1.8.5 Ranger

ranger_parnsnip <-
  rand_forest() %>%
  set_engine('ranger') %>%
  set_mode('classification')

res.ranger <- ranger_parnsnip %>%
  fit(Species ~ ., data = iris_centered) %>%
  predict(iris_centered) %>% 
  pull(.pred_class)

ranger.a.centered <- accuracy(cbind(iris_centered, res.ranger), Species, res.ranger)$.estimate
res.iris$ranger <- res.ranger

ranger.plot.centered <- ggplot(res.iris, aes(x=ranger, fill=species)) + 
  geom_bar() + 
  labs(title = "Ranger model",
       x="Species from Ranger model")
ggplotly(ranger.plot.centered)

1.8.6 Comparaison models

The objective right now is to compare the 4 models. 2 methods : graphically, and with accuracy of each model.

Four graphs on the top : models without PCA, four models int he bottom : with PCA.

(knn.plot.centered + km.plot.centered)/ (xgboost.plot.centered + ranger.plot.centered)

The accuracy table :

data.frame(model = c("knn", "kmeans", "xgboost", "ranger"),
           accuracy = c(knn.a, km.a, xgboost.a, ranger.a),
           accuracy_PCA = c(knn.a.pca, km.a.pca, xgboost.a.pca, ranger.a.pca),
           accuracy_scaled = c(knn.a.scaled, km.a.scaled, xgboost.a.scaled, ranger.a.scaled), 
           accuracy_centered = c(knn.a.centered, km.a.centered, xgboost.a.centered, ranger.a.centered))
##     model  accuracy accuracy_PCA accuracy_scaled accuracy_centered
## 1     knn 0.9733333    0.9466667            0.96              0.96
## 2  kmeans 0.3200000    0.2600000            0.32              0.26
## 3 xgboost 1.0000000    1.0000000            1.00              1.00
## 4  ranger 0.9800000    1.0000000            0.98              0.98

1.9 Conclusion

For the iris dataset, the best model for predicting the species is the Xgboost model (regardless of the transformations tested). The Kmeans model is a very poor model for predicting species. The PCA and centered transformations only had an impact on the kmeans model, by reducing its accuracy.

1.10 Session

print( paste0( "System version : ", sessionInfo()$running, ", ", sessionInfo()$platform) )
## [1] "System version : Windows 10 x64 (build 19045), x86_64-w64-mingw32/x64 (64-bit)"
print( paste0( R.version$version.string, " - ", R.version$nickname ) )
## [1] "R version 4.2.0 (2022-04-22 ucrt) - Vigorous Calisthenics"
for (package in c( sessionInfo()$basePkgs, objects(sessionInfo()$otherPkgs) ) ) {
  print( paste0( package, " : ", package, "_", packageVersion(package) ) ) }
## [1] "stats : stats_4.2.0"
## [1] "graphics : graphics_4.2.0"
## [1] "grDevices : grDevices_4.2.0"
## [1] "utils : utils_4.2.0"
## [1] "datasets : datasets_4.2.0"
## [1] "methods : methods_4.2.0"
## [1] "base : base_4.2.0"
## [1] "broom : broom_1.0.1"
## [1] "corrplot : corrplot_0.92"
## [1] "dials : dials_1.1.0"
## [1] "dplyr : dplyr_1.0.10"
## [1] "factoextra : factoextra_1.0.7"
## [1] "FactoMineR : FactoMineR_2.6"
## [1] "forcats : forcats_0.5.2"
## [1] "ggplot2 : ggplot2_3.4.0"
## [1] "gtsummary : gtsummary_1.6.3"
## [1] "infer : infer_1.0.4"
## [1] "kknn : kknn_1.3.1"
## [1] "modeldata : modeldata_1.0.1"
## [1] "parsnip : parsnip_1.0.3"
## [1] "patchwork : patchwork_1.1.2"
## [1] "plotly : plotly_4.10.1"
## [1] "purrr : purrr_0.3.5"
## [1] "randomForest : randomForest_4.7.1.1"
## [1] "ranger : ranger_0.14.1"
## [1] "readr : readr_2.1.3"
## [1] "recipes : recipes_1.0.3"
## [1] "rsample : rsample_1.1.1"
## [1] "scales : scales_1.2.1"
## [1] "skimr : skimr_2.1.4"
## [1] "stringr : stringr_1.5.0"
## [1] "tibble : tibble_3.1.8"
## [1] "tidymodels : tidymodels_1.0.0"
## [1] "tidyr : tidyr_1.2.1"
## [1] "tidyverse : tidyverse_1.3.2"
## [1] "tune : tune_1.0.1"
## [1] "workflows : workflows_1.1.2"
## [1] "workflowsets : workflowsets_1.0.0"
## [1] "yardstick : yardstick_1.1.0"